Skip to content

fix: use PyArrowFileIO for S3 access in get_dbt_model_as_dataframe#2218

Merged
rachellougee merged 3 commits into
mainfrom
fix/pyiceberg-pyarrow-fileio-hang
May 11, 2026
Merged

fix: use PyArrowFileIO for S3 access in get_dbt_model_as_dataframe#2218
rachellougee merged 3 commits into
mainfrom
fix/pyiceberg-pyarrow-fileio-hang

Conversation

@rachellougee
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee commented May 11, 2026

What are the relevant tickets?

NA

These dagster runs hang on production

https://pipelines.odl.mit.edu/locations/lakehouse/jobs/instructor_onboarding_daily_job/runs
https://pipelines.odl.mit.edu/assets/reporting/student_risk_probability?view=events

Description (What does it do?)

Problem

After the aiobotocore upgrade from 3.4.0 → 3.5.0 (introduced in (#2178)), Dagster runs using get_dbt_model_as_dataframe would hang indefinitely.
The root cause: aiobotocore's async event loop thread was populating botocore's lazy-loader cache, blocking all pending S3 coroutines with no timeout or error.

Fix

Switch GlueCatalog to use PyArrowFileIO (pyiceberg.io.pyarrow.PyArrowFileIO) instead of the default FsspecFileIO. PyArrow uses a native C++ S3 client and bypasses aiobotocore entirely, eliminating the hang.

How can this be tested?

The hang only reproduces in a live Dagster environment with aiobotocore 3.5.0+, so this cannot be fully covered by local docker compose up: To verify manually:

  1. Deploy this branch to the QA Dagster environment
  2. Trigger any asset that calls get_dbt_model_as_dataframe (e.g. a lakehouse asset that reads an Iceberg table via Glue)
  3. Confirm the run completes without hanging — previously it would stall indefinitely on S3 reads
  4. Check Dagster run logs for no FsspecFileIO / aiobotocore references in stack traces

Additional Context

Copilot AI review requested due to automatic review settings May 11, 2026 16:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses production Dagster runs hanging during S3 reads when get_dbt_model_as_dataframe loads Iceberg tables via AWS Glue, by switching PyIceberg’s S3 access away from the default fsspec/aiobotocore path to PyArrow’s native S3 client.

Changes:

  • Configure GlueCatalog to use pyiceberg.io.pyarrow.PyArrowFileIO for S3 access in get_dbt_model_as_dataframe.
  • Update the function’s docstring to reflect pl.LazyFrame semantics and document the aiobotocore-related hang context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/lib/glue_helper.py Outdated
Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/lib/glue_helper.py
@rachellougee rachellougee merged commit 4a31285 into main May 11, 2026
6 checks passed
@rachellougee rachellougee deleted the fix/pyiceberg-pyarrow-fileio-hang branch May 11, 2026 18:12
rachellougee added a commit that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants